Crime vs. High School Educational Outcomes – Montgomery County, MD

Nick Schaap

CMSC320 | Spring 2022

Background

Growing up in Montgomery County, MD, I have first-hand knowledge of some of the biggest issues faced by my county's school district. MCPS is the 14th largest public school district in the US by enrollment. It is also extremely diverse. Its 165,000+ students comprise more than 150+ countries and speak more than 150 languages (https://www.montgomeryschoolsmd.org/about/). One of the biggest issues faced by MCPS is the achievement gap. Currently, there exist large discrepancies in educational outcomes across MCPS's 26 high school clusters. This can largely be attributed to various external factors such as income level and other environmental factors. I wanted to examine specifically the correlation between crime and key school profile information such as graduation rate, attendance rate, dropout rates, free and reduced lunch program enrollment (FARMS), and student to staff ratios to see if MCPS is providing more staff support in underprivileged areas.

In [ ]:
import folium
from folium.plugins import HeatMap
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from fuzzywuzzy import fuzz
import geopy.distance

Data Collection

I wanted to collect both graduation rate and other profile information for each high school as well as crime data. The crime data dated as far back as mid 2016. I used the MCPS website to collect school at a glance information which was available up until 2021. Thus, I retrieved data from both sources from 2016-2021. I will be focusing my analysis during this time frame as this is when I was able to find data from both sources. The data is tabulated in pdf form so I quickly copied and pasted the data into an excel sheet and exported it as a CSV. Then I used pandas to do some data wrangling and parse the values for each school and year. I also used an open data API to retrieve school lattitude and longitude data so that I would be able to to filter crime data that is nearby certain schools. I matched the API data to the data retreived from the MCPS website using a library called fuzzywuzzy. I used fuzzywuzzy to sort the school info rows based off how closely the school name matched a school name value from the MCPS data. Then I took the closest match. This acccounted for any small discrepancies in how school names were formatted.

MCPS At-A-Glance Data with index matching to Montgommery County School Information API data

In [ ]:
# School profile information

# Read in the csv file
school_data = pd.read_csv('school_data.csv')

# Melt column data into individual row entries
school_data = school_data.melt(var_name="Year", value_name="Value")
for (i, row) in school_data.iterrows():
    values = row['Value'].split()
    columns = ['FARMS', 'ATTENDANCE', 'GRADUATION', 'DROPOUT', 'MOBILITY', 'STUD/STAFF RATIO', 'ENG CLASS SIZE', 'OTHER CLASS SIZE']
    # Parsing individual metric values
    school_data.at[i, 'High School'] = " ".join(values[0:(len(values) - len(columns))])
    for j in range(0, len(columns)):
        school_data.at[i, columns[-(j+1)]] = values[-(j+1)]
school_data = school_data.drop(columns=['Value'])

# Removing <,> quantifiers. Some of the profile information is intentionally masked at upper 
# boundaries for student privacy reasons.
def containsNumber(value):
    for character in value:
        if character.isdigit():
            return True
    return False

school_data['GRADUATION'] = school_data['GRADUATION'].apply(lambda x:  float(x) if x[0].isdigit() else float(x[1:]) if containsNumber(x) else -1)
school_data = school_data[school_data['GRADUATION'] > 0]

school_data['FARMS'] = school_data['FARMS'].apply(lambda x:  float(x) if x[0].isdigit() else float(x[1:]) if containsNumber(x) else -1)
school_data = school_data[school_data['FARMS'] > 0]

school_data['DROPOUT'] = school_data['DROPOUT'].apply(lambda x:  float(x) if x[0].isdigit() else float(x[1:]) if containsNumber(x) else -1)
school_data = school_data[school_data['DROPOUT'] > 0]

school_data['ATTENDANCE'] = school_data['ATTENDANCE'].apply(lambda x:  float(x) if x[0].isdigit() else float(x[1:]) if containsNumber(x) else -1)
school_data = school_data[school_data['ATTENDANCE'] > 0]

school_data['STUD/STAFF RATIO'] = school_data['STUD/STAFF RATIO'].apply(lambda x:  float(x) if x[0].isdigit() else float(x[1:]) if containsNumber(x) else -1)
school_data = school_data[school_data['STUD/STAFF RATIO'] > 0]
school_data = school_data.astype({'Year': 'int'})
school_data = school_data[school_data['Year'] >= 2016]


# School geographical information
schools_info = pd.read_json('https://data.montgomerycountymd.gov/resource/7ycz-azby.json')

for (i, row) in school_data.iterrows():
    school_name = row["High School"]
    find_school = schools_info.copy(deep=True)
    # Finding the closest match to the API data using fuzzywuzzy partial_ratio on the school name
    find_school['match_score'] = find_school.apply(lambda x: fuzz.partial_ratio(school_name, x["school_name"]), axis=1)
    # Sorting based on similarity scores
    find_school = find_school.sort_values(by='match_score', ascending=False).head(1).index.values[0]
    # Choosing the row index with the highest similarity as the matching index
    school_data.at[i, "school_info_index"] = find_school
school_data = school_data.astype({'school_info_index':'int'})
school_data
Out[ ]:
Year High School OTHER CLASS SIZE ENG CLASS SIZE STUD/STAFF RATIO MOBILITY DROPOUT GRADUATION ATTENDANCE FARMS school_info_index
0 2021 Bethesda-Chevy Chase HS 17.6 14.6 15.1 7.7 5.0 92.7 93.4 21.7 1
1 2021 Montgomery Blair HS 21.1 20.2 13.3 10.0 8.2 88.2 90.5 51.2 20
2 2021 James Hubert Blake HS 19.0 16.1 13.4 9.7 5.0 92.1 92.3 56.3 23
3 2021 Winston Churchill HS 22.9 19.4 12.3 ≤5.0 5.0 95.0 93.8 8.4 7
4 2021 Clarksburg HS 16.7 17.9 13.2 10.7 5.0 93.2 93.0 44.8 11
... ... ... ... ... ... ... ... ... ... ... ...
151 2016 Springbrook HS 14.3 14.4 11.0 15.2 9.3 85.4 91.5 73.7 22
152 2016 Watkins Mill HS 12.8 11.8 10.1 21.9 12.9 79.3 87.3 81.1 17
153 2016 Wheaton HS 16.1 18.4 12.5 11.4 16.1 77.2 90.9 76.2 24
154 2016 Walt Whitman HS 16.5 17.4 12.8 7.4 5.0 95.0 93.9 5.0 2
155 2016 Thomas S. Wootton HS 19.5 16.9 14.1 ≤5.0 5.0 95.0 94.9 13.7 4

150 rows × 11 columns

Montgomery County High Schools Information API Data

In [ ]:
# Filtering the API columns to those of interest/ value
schools_info = schools_info[['school_name', 'zip_code', 'city', 'address', 'latitude', 'longitude']]
schools_info
Out[ ]:
school_name zip_code city address latitude longitude
0 Walter Johnson HS 20814 Bethesda 6400 Rock Spring Dr 39.025392 -77.130102
1 Bethesda-Chevy Chase HS 20814 Bethesda 4301 East West Hwy 38.986826 -77.088970
2 Walt Whitman HS 20817 Bethesda 7100 Whittier Blv 38.981631 -77.127673
3 Poolesville HS 20837 Poolesville 17501 Willard Rd 39.143103 -77.418780
4 Thomas S Wootton HS 20850 Rockville 2100 Wootton Pkw 39.076582 -77.183197
5 Rockville HS 20851 Rockville 2100 Baltimore Rd 39.086348 -77.118272
6 Richard Montgomery HS 20852 Rockville 250 Richard Montgomery Dr 39.077292 -77.145730
7 Winston Churchill HS 20854 Potomac 11300 Gainsborough Rd 39.044305 -77.173128
8 Col Zadok Magruder HS 20855 Rockville 5939 Muncaster Mill Rd 39.131311 -77.118806
9 Sherwood HS 20860 Sandy Spring 300 Olney Sandy Spring Rd 39.148342 -77.018772
10 Paint Branch HS 20866 Burtonsville 14121 Old Columbia Pik 39.088679 -76.947102
11 Clarksburg HS 20871 Clarksburg 22500 Wims Rd 39.225502 -77.265587
12 Damascus HS 20872 Damascus 25921 Ridge Rd 39.282496 -77.210020
13 Northwest HS 20874 Germantown 13501 Richter Farm Rd 39.151593 -77.279329
14 Seneca Valley HS 20874 Germantown 19401 Crystal Rock Dr 39.175094 -77.264332
15 Gaithersburg HS 20877 Gaithersburg 314 S Frederick Ave 39.134839 -77.195478
16 Quince Orchard HS 20878 Gaithersburg 15800 Quince Orchard Rd 39.115933 -77.254239
17 Watkins Mill HS 20879 Gaithersburg 10301 Apple Ridge Rd 39.183967 -77.215836
18 Albert Einstein HS 20895 Kensington 11135 Newport Mill Rd 39.039616 -77.067036
19 Northwood HS 20901 Silver Spring 919 University Blv W 39.035694 -77.022484
20 Montgomery Blair HS 20901 Silver Spring 51 E University Blv 39.018273 -77.012434
21 John F Kennedy HS 20902 Silver Spring 1901 Randolph Rd 39.065750 -77.039028
22 Springbrook HS 20904 Silver Spring 201 Valley Brook Dr 39.057802 -77.005681
23 James Hubert Blake HS 20905 Silver Spring 300 Norwood Rd 39.113330 -77.017506
24 Wheaton HS 20906 Silver Spring 12601 Dalewood Dr 39.061338 -77.066669
In [ ]:
# Lookup function that returns the API data row corresponding to a particular school
def get_school_info(school_name):
    return schools_info[schools_info["school_name"] == school_name].head(1).reset_index().iloc[0,:]

Montgomery County Crime Data (2016-present)

I obtained crime data from the Montgomery County, Maryland open data archive. This data dates back to 2013 and contains over 294,000 crime records. This API was initally set to limit returned records to 1000 records. I passed in a query parameter to override this default as well as order the records according to the start_date of the logged crime event. This allowed me to avoid writing my own pagination logic to retreive all records of the dataset but also results in large response times (~30 seconds).

In [ ]:
crime_data = pd.read_json("https://data.montgomerycountymd.gov/resource/icn6-v9z3.json?$limit=300000&$order=start_date")
crime_data
Out[ ]:
incident_id offence_code case_number date nibrs_code victims crimename1 crimename2 crimename3 district ... address_street street_type start_date latitude longitude police_district_number geolocation end_date street_prefix_dir street_suffix_dir
0 201353560 9199 210046695 2021-11-17 16:35:01 90Z 1 Other All Other Offenses POLICE INFORMATION GERMANTOWN ... GREAT PARK CIR 2016-07-01T00:00:00.000 39.202690 -77.254900 5D {'latitude': '39.2027', 'longitude': '-77.2549... NaN NaN NaN
1 201089117 9105 16035632 NaT 90Z 1 Other All Other Offenses LOST PROPERTY WHEATON ... CUTSTONE WAY 2016-07-01T00:00:00.000 39.096388 -77.028240 4D {'latitude': '39.0964', 'longitude': '-77.0282... NaN NaN NaN
2 201103232 9105 16053356 NaT 90Z 1 Other All Other Offenses LOST PROPERTY SILVER SPRING ... NEW HAMPSHIRE AVE 2016-07-01T00:00:00.000 39.033464 -76.986127 3D {'latitude': '39.0335', 'longitude': '-76.9861... 2016-08-31T00:00:00.000 NaN NaN
3 201102727 9105 16052672 NaT 90Z 1 Other All Other Offenses LOST PROPERTY MONTGOMERY VILLAGE ... ANTARES DR 2016-07-01T00:00:00.000 39.160745 -77.145064 6D {'latitude': '39.1607', 'longitude': '-77.1451... NaN NaN NaN
4 201087611 9108 16033838 NaT 90Z 1 Other All Other Offenses RECOVERED PROPERTY - MONT CO. SILVER SPRING ... SOUTHAMPTON DR 2016-07-01T00:00:00.000 39.006138 -76.983357 3D {'latitude': '39.0061', 'longitude': '-76.9834... 2016-07-01T00:00:00.000 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
294565 201376151 4899 220020555 2022-05-14 12:45:43 90Z 1 Crime Against Society All Other Offenses OBSTRUCT POLICE (DESCRIBE OFFENSE) WHEATON ... GEORGIA AVE 2022-05-14T12:45:00.000 39.088660 -77.079800 4D {'latitude': '39.0887', 'longitude': '-77.0798... NaN NaN NaN
294566 201376159 9113 220020575 2022-05-14 15:29:52 90Z 1 Other All Other Offenses MENTAL ILLNESS - EMERGENCY PETITION MONTGOMERY VILLAGE ... ORCHARD DR 2022-05-14T15:29:00.000 39.065920 -77.173600 6D {'latitude': '39.0659', 'longitude': '-77.1736... NaN NaN NaN
294567 201376170 2305 220020582 2022-05-14 16:28:22 23F 1 Crime Against Property Theft From Motor Vehicle LARCENY - FROM AUTO ROCKVILLE ... GREAT FALLS RD 2022-05-14T16:07:00.000 39.078390 -77.161500 1D {'latitude': '39.0784', 'longitude': '-77.1615... 2022-05-14T16:10:00.000 NaN NaN
294568 201376175 2399 220020594 NaT 23H 1 Crime Against Property All other Larceny LARCENY (DESCRIBE OFFENSE) ROCKVILLE ... HIGGINS PL 2022-05-14T17:51:00.000 39.063410 -77.118800 1D {'latitude': '39.0634', 'longitude': '-77.1188... 2022-05-14T17:53:00.000 NaN NaN
294569 201376183 9199 220020609 NaT 90Z 1 Other All Other Offenses POLICE INFORMATION ROCKVILLE ... VEIRS MILL RD 2022-05-14T20:39:00.000 0.000000 0.000000 1D {'latitude': '0.0', 'longitude': '0.0', 'human... NaN NaN NaN

294570 rows × 30 columns

The Montgomery County Open Data Crime API returns various interesting information about each recorded crime including: nibrs_code, crimename1, crimename2, crimename3, start_date, latitude, and longitude. I plan to use the latitude and longitude information to filter crime events near various schools. The nibrs_code is another interesting feature that allows me to filter the crime data for specific categories of offenses as defined by the National Incident-Based Reporting System standards.

Visualizing Discrepancies in Crime Levels surrounding various High Schools

Mapping Crime Data and School Locations

In [ ]:
# Function that uses the folium library to generate a heat map of the various crimes passed 
# as well as labeling the location of each of MCPS high school
def generate_heat_map(crimes, zoom=11, center=[39.1547, -77.2405], include=None):
    map_osm = folium.Map(location=center, zoom_start=zoom, tiles = "Stamen Toner")
    for _, school in schools_info.iterrows():
        if include is not None and school["school_name"] not in include:
            continue
        folium.Marker(location=[school["latitude"], school["longitude"]], tooltip=school["school_name"],
                        icon=folium.Icon(color='red')).add_to(map_osm)
    heat_data = [[row['latitude'],row['longitude']] for _, row in crimes.iterrows()]
    HeatMap(heat_data, radius=15).add_to(map_osm)
    return map_osm

All Crimes

In [ ]:
generate_heat_map(crime_data)
Out[ ]:
Make this Notebook Trusted to load map: File -> Trust Notebook

The above heat map shows crime data as well as the locations of MCPS's high schools. Just from looking at this map, no interesting patterns jump out immediately. It seems like crime levels are generally pretty uniformly distributed around each high school. I wanted to take a closer look and specifically look at crimes against people. I also want to zoom in a bit more to inspect a more detailed heat map.

Mapping Crimes Against People

In [ ]:
crimes_against_people = crime_data[crime_data['crimename1'].str.contains("Person", na=False)]
generate_heat_map(crimes_against_people, zoom=13)
Out[ ]:
Make this Notebook Trusted to load map: File -> Trust Notebook

This map returned more meaningful results. Now I could clearly see how some high schools such as Wheaton High School had relatively high levels of crimes against people while other high schools such as Walt Whitman High School saw relatively low levels of crime against people.

Crimes against People in the Area Surrounding Wheaton High School

In [ ]:
center = list(get_school_info("Wheaton HS")[["latitude", "longitude"]])
generate_heat_map(crimes_against_people, zoom=13, center=center, include=["Wheaton HS"])
Out[ ]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Crimes against People in the Area Surrounding Walt Whitman High School

In [ ]:
center = list(get_school_info("Walt Whitman HS")[["latitude", "longitude"]])
generate_heat_map(crimes_against_people, zoom=13, center=center, include=["Walt Whitman HS"])
Out[ ]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Visualizing School Profile Information

Now, that I have visualized a discrepancy in crime surrounding different high schools around the county I want to begin connecting the dots and see how crime may be affecting various key educational outcomes and school profile details.

In [ ]:
metrics = ['FARMS', 'ATTENDANCE', 'GRADUATION', 'DROPOUT', 'STUD/STAFF RATIO']
for metric in metrics:
    for school in school_data['High School'].unique():
        school_info = school_data[school_data['High School'] == school]
        plt.plot(school_info['Year'].sort_values(), school_info[metric], label=school)
        plt.title(f'{metric.title()} rate vs. time')
        plt.xlabel('Year')
        plt.ylabel(f'{metric.title()} rate')
    plt.show()

Visualizing School Profile Metrics vs. Nearby Crime Rates

I hypothesize that an increase in crime in the area surrounding (within 5 mile radius) a high school will have a negative impact on all of the school profile metrics under study (graduation rate, attendance rate, dropout rates, free and reduced lunch program enrollment (FARMS)). In order to test my hypothesize I needed to find a way to get all of the crimes in the surrounding area of each school. To accomplish this, I created a function that will filter the crime data retrieved from the Montgomery County Open Data API to select only crimes within a 5 mile radius of the passed in school.

I used a library called geopy in order to calculate the distance between two points defined by their latitude and longitude. Geopy does this using the geodesic distance between the two points. This accounts for the rounding of the earth when calculating distances. This made finding the distance between schools and crime incidents rather trivial since I was given this information in both of my datasets.

In [ ]:
def get_crimes_near_school(school_name, nibrs_codes = None, mile_radius = 5, crime_list = None):
    # Allow filtering to select only crimes conforming to certain nibrs codes
    if crime_list is None:
        crime_list = crime_data
    if nibrs_codes is not None:
        crimes = crime_list[crime_list['nibrs_code'].str.contains("|".join(nibrs_codes), na=False)]
    else:
        crimes = crime_list
    school = get_school_info(school_name)
    # Extract school latitude and longitude information
    location = tuple(school[["latitude", "longitude"]])
    crimes = crimes.copy(deep=True)
    for(i, row) in crimes.iterrows():
        crime_location = tuple(row[["latitude", "longitude"]])
        # Define a crime as being near a school if it occurred within 5 miles of the school
        crimes.at[i, "near-school"] = geopy.distance.distance(location, crime_location).miles <= mile_radius
    # Return all crimes marked as near the passed in school
    return crimes[crimes["near-school"] == True] 

In order to do some initial plotting I took the raw timebased school metric data and averaged over the entire period for which data was collected. This gave me a characteristic value for each of the schools. I also added a column showing the number of crimes committed within a 5 mile radius of the school. In order to filter down the number of crimes I specifically focused on crimes marked as family offenses (NIBRS CODE = 90F). These crimes are likely to have the most profound impact on educational outcomes as they are ones that involve minors. Crimes may include neglect and domestic violence.

In [ ]:
# Collecting a description of all crimes listed as nibrs code 90F (family offenses) in the dataset
crime_data[crime_data['nibrs_code'] == "90F"]["crimename3"].unique()
Out[ ]:
array(['FAMILY OFFENSE - NEGLECT CHILD (INCLUDES NONSUPPOR',
       'FAMILY OFFENSE (DESCRIBE OFFENSE)',
       'FAMILY OFFENSE - NEGLECT FAMILY',
       'FAMILY OFFENSE - CRUELTY TOWARD CHILD',
       'FAMILY OFFENSE - CRUELTY TOWARD WIFE'], dtype=object)
In [ ]:
# Grouping the school data by High school and the school info index and taking the mean of any columns containing floating point data
average_school_rates = school_data.groupby(by=["High School", "school_info_index"]).mean().reset_index()

for (i, row) in average_school_rates.iterrows():
    # Get the appropriate school name used by the school info dataframe
    school_name = schools_info.iloc[row["school_info_index"]]["school_name"]
    # Get crimes near the school
    crimes = get_crimes_near_school(school_name, nibrs_codes=['90F'])
    # Count the numebr of crimes
    numCrimes = crimes["incident_id"].count()
    average_school_rates.at[i, "numCrimes"] = numCrimes

average_school_rates = average_school_rates.sort_values(by="numCrimes")
average_school_rates
Out[ ]:
High School school_info_index Year STUD/STAFF RATIO DROPOUT GRADUATION ATTENDANCE FARMS numCrimes
12 Poolesville HS 3 2018.5 14.200000 5.000000 95.000000 94.233333 15.000000 3.0
4 Damascus HS 12 2018.5 12.116667 5.000000 90.700000 92.383333 31.916667 21.0
17 Sherwood HS 9 2018.5 13.050000 5.000000 90.466667 92.800000 29.350000 58.0
20 Walt Whitman HS 2 2018.5 12.833333 5.000000 95.000000 94.000000 5.016667 69.0
11 Paint Branch HS 10 2018.5 12.300000 7.766667 89.533333 94.466667 66.400000 77.0
24 Winston Churchill HS 7 2018.5 12.800000 5.000000 95.000000 93.383333 9.183333 108.0
2 Clarksburg HS 11 2018.5 13.100000 5.000000 92.916667 92.916667 49.900000 149.0
6 James Hubert Blake HS 23 2018.5 13.266667 5.000000 91.766667 92.550000 58.950000 150.0
19 Thomas S. Wootton HS 4 2018.5 14.116667 5.000000 95.000000 94.883333 13.733333 164.0
21 Walter Johnson HS 0 2018.5 14.200000 5.000000 94.150000 93.833333 18.600000 166.0
1 Bethesda-Chevy Chase HS 1 2018.5 15.000000 5.000000 94.550000 93.233333 21.016667 171.0
14 Richard Montgomery HS 6 2018.5 14.266667 5.000000 92.183333 92.583333 37.450000 201.0
8 Montgomery Blair HS 20 2018.5 13.483333 7.766667 86.850000 90.416667 53.533333 213.0
3 Col. Zadok Magruder HS 8 2018.5 11.433333 7.016667 86.683333 90.116667 57.000000 219.0
13 Quince Orchard HS 16 2018.5 12.333333 5.000000 92.266667 91.116667 43.216667 224.0
15 Rockville HS 5 2018.5 9.650000 5.850000 85.616667 92.266667 52.583333 229.0
9 Northwest HS 13 2018.5 13.533333 5.000000 94.850000 92.150000 43.383333 233.0
23 Wheaton HS 24 2018.5 12.516667 14.666667 78.983333 90.766667 75.700000 244.0
7 John F. Kennedy HS 21 2018.5 10.200000 10.100000 81.966667 88.616667 82.950000 246.0
16 Seneca Valley HS 14 2018.5 10.083333 6.833333 84.650000 90.050000 67.483333 257.0
5 Gaithersburg HS 15 2018.5 10.300000 15.666667 75.216667 86.833333 74.750000 261.0
18 Springbrook HS 22 2018.5 11.166667 9.350000 85.016667 91.450000 73.283333 265.0
10 Northwood HS 19 2018.5 10.816667 14.800000 80.550000 87.816667 73.516667 277.0
22 Watkins Mill HS 17 2018.5 10.133333 12.350000 79.866667 87.450000 81.066667 287.0
0 Albert Einstein HS 18 2018.5 10.900000 10.433333 82.200000 89.533333 65.300000 290.0

I created a helper function that allows me to plot a passed in school profile metric vs number of crimes. This allowed me to see the relationship between the two if one existed. I also used numpy to add a linear regression line to the resulting plots.

In [ ]:
def plot_school_rates_vs_crime(rate, plt, label_min=None, deg = 1, label_max=None, school_labels=None, crime_type="family crimes"):
    x = average_school_rates["numCrimes"]
    y = average_school_rates[rate]
    schools = average_school_rates["High School"]
    coef = np.polynomial.polynomial.polyfit(x, y, deg)
    x_fit = range(int(np.max(x)) + 5)
    y_fit = np.polyval(coef[::-1], x_fit)
    plt.plot(x,y, 'yo', x_fit, y_fit, '--k')
    for i, label in enumerate(schools):
        if (label_min is not None and y[i] <= label_min) or (label_max is not None and y[i] >= label_max) or (school_labels is not None and label in school_labels):
            plt.annotate(label, (x[i], y[i]))
    plt.set_title(f'Average {rate.lower()} rate vs. number of {crime_type} (2016-present)')
    plt.set_xlabel(f'Incidence of {crime_type} between (2016-present)')
    plt.set_ylabel(f'Average {rate.lower()} rate')

# Allow all plots to share the same x-axis
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, sharex=True)
axs = (ax1, ax2, ax3, ax4)
fig.set_size_inches(10, 20)
plot_school_rates_vs_crime("GRADUATION", ax1)
plot_school_rates_vs_crime("FARMS", ax2)
plot_school_rates_vs_crime("ATTENDANCE", ax3)
plot_school_rates_vs_crime("DROPOUT", ax4)
for ax in axs:
    ax.label_outer()

By looking at the above plots it's clear that some kind of correlation exists for each of the observed school metrics and family-related crime incidence rates. Some of these appear to be linear while others seem to be higher order. For example, the average FARMS enrollment rate seems to increase linearly with increasing crime incidence while other metrics such as dropout rate seem to grow slowly until around 200 crimes when the dropout rate begins to increase dramatically. In order to confirm the relationship between the variables I want to run a linear regression analysis on my data and observe the various p-values for the coefficients of the linear model.

Hypothesis Testing

In [ ]:
# Importing regression model packages
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
In [ ]:
metrics = ["GRADUATION", "DROPOUT", "ATTENDANCE", "FARMS"]
for metric in metrics:
    X = average_school_rates['numCrimes']
    X = np.array(X).reshape(-1, 1)
    y = np.array(average_school_rates[metric])
    # Creating a linear regression model
    reg = LinearRegression().fit(X, y)

    
    X = sm.add_constant(X)
    est = sm.OLS(y, X).fit()
    [const_pvalue, x1_pvalue] = est.pvalues
    [const, x1] = est.params
    print(metric)
    print(f"const: {const} (p-value: {const_pvalue})")
    print(f"x1: {x1} (p-value: {x1_pvalue}")
    print(f"R^2 {est.rsquared}")
    print("-----------------------------------------")
    
GRADUATION
const: 97.216262471562 (p-value: 9.851860857745965e-24)
x1: -0.04788809001652489 (p-value: 0.00022144671062582577
R^2 0.45413679284973374
-----------------------------------------
DROPOUT
const: 2.858635649482769 (p-value: 0.04981389324036021)
x1: 0.02534572430443709 (p-value: 0.0011957826928370252
R^2 0.37249967464648315
-----------------------------------------
ATTENDANCE
const: 95.23401307083185 (p-value: 2.9156546200305736e-34)
x1: -0.019860394319248485 (p-value: 2.0144895672493217e-05
R^2 0.553541496201198
-----------------------------------------
FARMS
const: 10.888316379960187 (p-value: 0.21716782017223787)
x1: 0.20254810646755314 (p-value: 8.656385172374484e-05
R^2 0.4953177253450447
-----------------------------------------

The p-values for the Linear Regression fit seem to by less than the standard significance level $\alpha < 0.05$ for all coefficients except the constant on the FARMS regression. This makes a good case for the linear regression model to be used as a predictor between each school metric and the crime incidence in the surrounding area, however I want to try fitting polynomials of higher degree to see if we can obtain a better fit. However, the $R^2$ values lie around 0.5. This isn't a particularly correlation coefficient and this could mean that the true correlation is not completely explained by this single variable.

In [ ]:
# Plotting fits with degree 2
# Allow all plots to share the same x-axis
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, sharex=True)
axs = (ax1, ax2, ax3, ax4)
fig.set_size_inches(10, 20)
plot_school_rates_vs_crime("GRADUATION", ax1, deg = 2)
plot_school_rates_vs_crime("FARMS", ax2, deg = 2)
plot_school_rates_vs_crime("ATTENDANCE", ax3, deg = 2)
plot_school_rates_vs_crime("DROPOUT", ax4, deg = 2)
for ax in axs:
    ax.label_outer()

The 2nd degree polynomial fits seem to do a better job at fitting the data especially at higher crime incidence rates. It seems that many of these metrics seem to grow / diminish at higher rates as the crime incidence increases. One drawback from a quadratic fit is the concavity of fit in lower crime incidence rates. For example, it doesn't make much sense that attendance rate would increase at first.

Investigating MCPS Support

I wanted to see how MCPS may be supporting under-privileged schools by increasing staff support at these schools. In order to get a sense for this I will plot student to staff ratios vs crime incidence.

In [ ]:
fig, (ax1) = plt.subplots(1, 1, sharex=True)
fig.set_size_inches(10, 5)
plot_school_rates_vs_crime("STUD/STAFF RATIO", ax1, deg = 1, label_max=14)
In [ ]:
metric = "STUD/STAFF RATIO"
X = average_school_rates['numCrimes']
X = np.array(X).reshape(-1, 1)
y = np.array(average_school_rates[metric])
X = sm.add_constant(X)
est = sm.OLS(y, X).fit()
[const_pvalue, x1_pvalue] = est.pvalues
[const, x1] = est.params
print(metric)
print(f"const: {const} (p-value: {const_pvalue})")
print(f"x1: {x1} (p-value: {x1_pvalue})")
print(f"R^2 {est.rsquared}")
print("-----------------------------------------")
STUD/STAFF RATIO
const: 14.142664460971487 (p-value: 7.27505762524598e-17)
x1: -0.009988348215689048 (p-value: 0.005046956302931085)
R^2 0.29470196054512177
-----------------------------------------

There seems to be a negative correlation between average student/staff ratios and crime incidence rates. This means that MCPS does seem to be supplying more staff support to schools in under-privileged areas. However, there are still some anomalies. For example, Damascus HS has relatively high student/staff ratio in comparison to other MCPS high schools with the same crime incidence rates. This could either indicate Damascus HS is being under-served or Damascus HS is being served in other ways that allow it to still perform on par with other MCPS high schools. Let's find out...

In [ ]:
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, sharex=True)
axs = (ax1, ax2, ax3, ax4)
fig.set_size_inches(10, 20)
plot_school_rates_vs_crime("GRADUATION", ax1, deg = 1, school_labels=["Damascus HS"])
plot_school_rates_vs_crime("FARMS", ax2, deg = 1, school_labels=["Damascus HS"])
plot_school_rates_vs_crime("ATTENDANCE", ax3, deg = 1, school_labels=["Damascus HS"])
plot_school_rates_vs_crime("DROPOUT", ax4, deg = 1, school_labels=["Damascus HS"])

By looking at how Damascus HS performs w.r.t. the various key school metrics we chose to explore it is clear that Damascus HS is still performing on par with the rest of MCPS high schools. It has both positive residuals for graduation and attendance rate as well as negative residuals for FARMS and dropout rates. From this we can likely conclude that Damascus HS has found other ways to get extra support and improve key school metrics and educational outcomes other than increased staff support.

Affect of other types of crimes

I want to see if some other types of crimes have the same kinds of affects on school performance metrics.

Plotting school metrics vs. drug crime incidence

In [ ]:
drug_crimes = crime_data[crime_data['nibrs_code'].str.contains("|".join(['35A', '35B']), na=False)]
# Take a random sample of 500 crime records from the total time frame to limit computer processing
drug_crimes = drug_crimes.sample(500)

for (i, row) in average_school_rates.iterrows():
    # Get the appropriate school name used by the school info dataframe
    school_name = schools_info.iloc[row["school_info_index"]]["school_name"]
    # Get crimes near the school
    crimes = get_crimes_near_school(school_name, nibrs_codes=['35A', '35B'], crime_list=drug_crimes)
    # Count the numebr of crimes
    numCrimes = crimes["incident_id"].count()
    average_school_rates.at[i, "numCrimes"] = numCrimes

# Allow all plots to share the same x-axis
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, sharex=True)
axs = (ax1, ax2, ax3, ax4)
fig.set_size_inches(10, 20)
plot_school_rates_vs_crime("GRADUATION", ax1, crime_type="drug-related crimes")
plot_school_rates_vs_crime("FARMS", ax2, crime_type="drug-related crimes")
plot_school_rates_vs_crime("ATTENDANCE", ax3, crime_type="drug-related crimes")
plot_school_rates_vs_crime("DROPOUT", ax4, crime_type="drug-related crimes")
for ax in axs:
    ax.label_outer()

metrics = ["GRADUATION", "DROPOUT", "ATTENDANCE", "FARMS"]
for metric in metrics:
    X = average_school_rates['numCrimes']
    X = np.array(X).reshape(-1, 1)
    y = np.array(average_school_rates[metric])
    # Creating a linear regression model
    reg = LinearRegression().fit(X, y)

    
    X = sm.add_constant(X)
    est = sm.OLS(y, X).fit()
    [const_pvalue, x1_pvalue] = est.pvalues
    [const, x1] = est.params
    print(metric)
    print(f"const: {const} (p-value: {const_pvalue})")
    print(f"x1: {x1} (p-value: {x1_pvalue}")
    print(f"R^2 {est.rsquared}")
    print("-----------------------------------------")
GRADUATION
const: 95.70541870579058 (p-value: 6.454832038647178e-24)
x1: -0.055687349574319644 (p-value: 0.0008642395404792349
R^2 0.38900857389102295
-----------------------------------------
DROPOUT
const: 3.4410724685063903 (p-value: 0.01295908240952867)
x1: 0.031138316458412106 (p-value: 0.001637887286991133
R^2 0.3561403332909897
-----------------------------------------
ATTENDANCE
const: 94.32903440031873 (p-value: 1.613920451122321e-33)
x1: -0.020961330474545684 (p-value: 0.0008373562202330881
R^2 0.39059489204728903
-----------------------------------------
FARMS
const: 20.93790471714498 (p-value: 0.029739509894774006)
x1: 0.20749102250297646 (p-value: 0.0027075056077280747
R^2 0.32926118480400224
-----------------------------------------
In [ ]:
fraud_crimes = crime_data[crime_data['nibrs_code'].str.contains("|".join(['26']), na=False)]
fraud_crimes = fraud_crimes.sample(500)

for (i, row) in average_school_rates.iterrows():
    # Get the appropriate school name used by the school info dataframe
    school_name = schools_info.iloc[row["school_info_index"]]["school_name"]
    # Get crimes near the school
    crimes = get_crimes_near_school(school_name, nibrs_codes=['26'], crime_list=fraud_crimes)
    # Count the numebr of crimes
    numCrimes = crimes["incident_id"].count()
    average_school_rates.at[i, "numCrimes"] = numCrimes

# Allow all plots to share the same x-axis
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, sharex=True)
axs = (ax1, ax2, ax3, ax4)
fig.set_size_inches(10, 20)
plot_school_rates_vs_crime("GRADUATION", ax1, crime_type="fraud-related crimes")
plot_school_rates_vs_crime("FARMS", ax2, crime_type="fraud-related crimes")
plot_school_rates_vs_crime("ATTENDANCE", ax3, crime_type="fraud-related crimes")
plot_school_rates_vs_crime("DROPOUT", ax4, crime_type="fraud-related crimes")
for ax in axs:
    ax.label_outer()

metrics = ["GRADUATION", "DROPOUT", "ATTENDANCE", "FARMS"]
for metric in metrics:
    X = average_school_rates['numCrimes']
    X = np.array(X).reshape(-1, 1)
    y = np.array(average_school_rates[metric])
    # Creating a linear regression model
    reg = LinearRegression().fit(X, y)

    
    X = sm.add_constant(X)
    est = sm.OLS(y, X).fit()
    [const_pvalue, x1_pvalue] = est.pvalues
    [const, x1] = est.params
    print(metric)
    print(f"const: {const} (p-value: {const_pvalue})")
    print(f"x1: {x1} (p-value: {x1_pvalue}")
    print(f"R^2 {est.rsquared}")
    print("-----------------------------------------")
GRADUATION
const: 94.52437830026768 (p-value: 1.6657912711198184e-20)
x1: -0.05267525075254798 (p-value: 0.03751969345130798
R^2 0.17484211504151403
-----------------------------------------
DROPOUT
const: 3.6174463337246205 (p-value: 0.04456770565145466)
x1: 0.03364398949338107 (p-value: 0.02163900445477452
R^2 0.20884911206472156
-----------------------------------------
ATTENDANCE
const: 94.10152088345478 (p-value: 2.2880911765392373e-30)
x1: -0.021706378838770406 (p-value: 0.021096007440834022
R^2 0.21040224885526615
-----------------------------------------
FARMS
const: 30.181250992743763 (p-value: 0.025588708956657425)
x1: 0.15434628064914802 (p-value: 0.1415943463603024
R^2 0.09152105191364557
-----------------------------------------

After looking at the effects of drug-related and fraud-related crimes on key school performance metrics. It seems like all of these crimes have similar effects on the key school performance metrics. Though, some more than others. For example, the correlation coefficients for fraud related crimes hover at around 0.1, and the correlation coefficients for drug-related crimes hover at around 0.35. These are lower than the family-related crimes correlations coefficients.

Results

From my exploratory data analysis I pulled the following observations/results:

  • There exists a discrepancy between crime incidence rates at different locations around Montgomery County and some high school clusters are disproporionately affected by higher crime rates in the surrounding area (within 5 mile radius)
  • Higher incidence of family offenses in the surrounding area of an MCPS high school has a negative impact on graduation and attendance rates.
  • Higher incidence of family offenses in the surrounding area of an MCPS high school has a positive impact on FARMS and dropout rates.
  • MCPS has generally employed greater staff support at underprivileged high schools.
  • Different types of crimes affect school performance metrics at varying levels.

While MCPS has employed greater support at underprivileged high schools, these trends still exist. Incidence of family offenses seem to have a siginificant impact on key educational outcomes. The question remains if MCPS can do more to help these students succeed while the odds are stacked against them or if this is more of a systemic issue that will always continue to exist so long as these types of crimes continue to occur.